Week #2 Exercise Questions

Today's Session Outline:-

  1. Data Loading
    • textFile
    • parallelize
    • from Master in a distributed environment (Week 3)
  2. MapReduce
    • transformation & actions
      • map & collect
      • filter & count
      • flatMap & take
      • filter & reduce
      • union, intersection
      • join, leftOuterJoin, cartesian, cogroup
      • reduceByKey & collect
  3. Config
    • using external files (Week 3)
    • using API
    • Spark Session (discussed with Spark SQL)
  4. Partitions
    • fill it with random data
    • find/use the partitions
  5. Spark SQL
    • Basic operations

Things to remember:-

  • Learn your hardware configuration like RAM, CPU cores, etc.
  • All Spark functions will be given along with the questions, you have to fill Spark function with their respective parameters
    and write the corresponding Scala or Python Logic
  • This is a practice session, so no scores are calculated
  • For quicker programming, we will use the shell environment today
  • If your IDE configurations aren't working, approach us after the exercise session
  • Ofcourse, Solutions will be provided for these questions after this exercise session
  • If you are already familiar with the contents listed above, go ahead in learning Spark SQL

Configuration:-

https://github.com/shathil/BigDataExercises

From the above url, download the git repository as a zip file. From the dowloaded zip package, extract a folder called resources and place it inside your spark 2.x folder.

1. Data Loading

Question 1

  • In the Spark shell, the SparkContext object is already initialized and available as sc.
  • Now load a "textFile" 'pg1567.txt' from the resources folder into an RDD named textRDD
  • Use the path as 'resources/pg157.txt' to load the textFile.
  • Now transform the textRDD into a map and split each line into words(use space as delimiter).
  • Finally count the number of words using "count"

Spark functions needed:-

  • textFile
  • map
  • count

Question 2

  • Create an Scala array or a Python list containg any 5 numbers and store it into a variable called "numbers"
  • Use Spark "parallelize" function on the numbers variable and store the resultant RDD as numRDD
  • Now transform the numRDD into a map by finding the squares of all numbers, store it as numMap
  • Finally collect the squared numbers using "collect"

Spark functions needed:-

  • parallelize
  • map
  • collect

2. MapReduce

Question 3

  • Generate random 10000 numbers and store it in a variable called "numbers"
  • parallelize the "numbers" and store the result into "numRDD"
  • Now filter the positive numbers from numRDD and,
  • Count the positive numbers

Spark functions needed:-

  • parallelize
  • filter
  • count

Question 4

  • Now use flatMap on "numRDD" and find the squares of all numbers and store it as "squaredRDD"
  • Try printing first element in squaredRDD using a Spark action called "first"
  • Try printing first 10 element in squaredRDD using a Spark action called "take"

Spark functions needed:-

  • flatMap
  • first
  • take

Hint:- flatMap-ed RDD might not work like map or filter functions. Calling actions on these flatMap-ed RDD directly, will definitely throw an error. Can you guess why?. If you have already experienced this, and solved it, congrats!.

Question 5

  • There are two ways to perform the logic inside the transformations: call a function, or an inline function
  • Try learning both ways
  • Combine question 3 and 4, Now filter all negative values, square them and print first 20 of them.

Spark functions needed:-

  • filter
  • use either map or flatMap
  • take

Question 6

  • Filter all positive values from numRDD into "filteredRDD"
  • Now reduce the filteredRDD (finding the sum of all positive values) and collect it
  • Also now collect the filteredRDD and the find the sum for the collected result
  • Compare the time difference between a native sum function and using a native sum function inside a Spark function. Did you find any difference?. No?
  • Okay now generate a million(1000000) random numbers and store it into numRDD and repeat above four steps
  • Still dont feel the significance, try increasing the count to 10 million ans repeat first four steps

Spark functions needed:-

  • reduce
  • collect

Note:

If your hardware is below 8GB and has other processes running , stop before 1 million. Try this when no other processes are running.

Hint:

  • There is an easy and graphical way to watch the time taken to complete the process.
  • Go to your browser, and type localhost:4040 and then hit Enter
  • What you see is Spark Web UI, we will cover this in detail during Cluster programming.
  • Did you notice the old Spark actions all listed there! :)

Question 7

  • This is an easy question
  • Create two RDDs with 5 numbers each called "num1" and "num2"
  • Find the "union" and "intersection" of the two RDD's
  • Told you it's easy :)

Spark functions needed:-

  • union
  • intersection

Question 8

  • This is also a simple one :)
  • I have given the RDD set 1 and 2 below. Try all listed Spark operations below on each set.

Spark functions needed:-

  • join
  • leftOuterJoin
  • rightOuterJoin
  • cartesian

RDD set 1:-


In [ ]:
# combining with just values
rdd1 =  sc.parallelize([("foo", 1), ("bar", 2), ("baz", 3)])
rdd2 =  sc.parallelize([("foo", 4), ("bar", 5), ("bar", 6)])

RDD set 2:-


In [ ]:
# combining with just list items
words1 = sc.parallelize(["Hello", "Human"])
words2 = sc.parallelize(["world", "all", "you", "Mars"])

Question 9

  • Let's count the words in all works of william shakespeare!
  • Load the 'textFile' named 'pg100.txt' from the resources folder into poemRDD
  • Use function chain on all these following actions:-
    • Transform the poemRDD into a flatMap and split the words using space(' ') as a delimiter
    • Now convert the flatMap into a map and convert each word into a tuple with two values (word, 1)
    • Now reduceByKey and add all the values. Here key is the word
    • Collect the result

Spark functions needed:-

  • flatMap
  • map
  • reduceByKey
  • collect

Self Exercise:-

Think about displaying words with highest values in descending order.

3. Config

Question 10

  • Try find all configuration for the default shell SparkContext. Whcih function to use?.
  • To create a new config, stop the running SparkContext first
  • Now create a new Conf Object with an AppName, and Master
  • Check if "spark.dynamicAllocation.enabled" is true, if False, execute the next step
  • Set the "spark.dynamicAllocation.enabled" to be "true" for the Conf
  • Pass the newly created conf as a parameter to the SparkContext constructor and assign the new SparkContext to sc

Hint:-

For finding the existing shell sc configuration,

http://stackoverflow.com/questions/30560241/is-it-possible-to-get-the-current-spark-context-settings-in-pyspark

Finding "spark.dynamicAllocation.enabled" status,

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-dynamic-allocation.html

For setting dynamicAllocation,

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-SparkConf.html

Note:-

So did you get curious why we set the Dynamic Allocation in the config?. Have you found out where it is useful?.

4. Partitions

Question 11

  • Now shall we try partitions with Question 6.
  • parallelize function takes two parameters, the first is the data object, the second is 'numSlices'
  • Fill the numSlices value <= your number of CPU cores
  • Now observe the time difference in executing steps 1 to 4 of Question 6
  • If feasible, try with 1 to 10 million random numbers
  • can you observe that finding optimum 'numSlices' value is itself an Optimization problem?.
  • Try using 'getNumPartitions' on any of the RDD's before and after increasing the number of partitions

Spark functions needed:-

  • parallelize
  • reduce
  • collect
  • getNumPartitions

Question 12

  • Another easy one, this is just a copy/paste of Question 9
  • Just with the textFile method, provide an addition argument 'minPartitions' and observe the time difference in evaluation

5. Spark SQL

Spark Session

  • We are going to create a Spark Session. Why this is different from SparkContext and SparkConf?.
  • Store the Spark session object into a variable called 'spark'

Hint :

https://docs.databricks.com/spark/latest/gentle-introduction/sparksession.html

Question 12

  • Loading data from JSON
  • using 'spark' variable, read 'peopl.json' from resources folder. Store it into a variable called 'df'
  • 'df' is a dataFrame, view it's contents using "show" method
  • use 'select' function on 'df' to choose a single column and display it using 'show' method
  • use 'filter' function on 'df' to execute a filter condition like 'df[age] > 18' and display the results using 'show'

Self Exercise:-

  • Learn to create Temp View
  • Learn to create Global Temp View
  • Find the difference between Global temporary view and temporary view. And check it using an example.

For More SQL Exercises Refer,

For Python, https://github.com/apache/spark/tree/master/examples/src/main/python/sql

For Scala, https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/sql


In [ ]:
## Question 13 - TODO Week 3

* Loading data from text file